Back

Artificial Intelligence in the Life Sciences

Elsevier BV

All preprints, ranked by how well they match Artificial Intelligence in the Life Sciences's content profile, based on 11 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
DeepDrug2: A Germline-focused Graph Neural Network Framework for Alzheimer's Drug Repurposing Validated by Electronic Health Records

Li, V. O. K.; Han, Y.; Lam, J. C. K.; Downey, J.

2025-05-14 health informatics 10.1101/2025.05.13.25327570 medRxiv
Top 0.1%
9.3%
Show abstract

Alzheimers disease (AD) is a complex neurodegenerative disorder with limited therapeutic options. The original DeepDrug framework by Li et al. (2025) relied on somatic mutation data and emphasized long genes to guide AD drug repurposing. However, emerging evidence suggests that germline genetic variants play a more central role in AD pathogenesis. In response, we develop DeepDrug2, an enhanced AI-driven framework for AD drug repurposing centered on germline mutations and validated using real-world electronic health records (EHRs). DeepDrug2 introduces four major innovations. First, it proposes a different hypothesis prioritizing germline over somatic mutations in influencing AD risk. Second, it updates the signed directed heterogeneous biomedical graph by removing somatic mutations, long genes, and expert-led genes from the previous version, and incorporating new genes identified in recent genome-wide association study (GWAS) findings. Third, it generates a new list of drug candidates by encoding this updated graph into a new embedding space via a graph neural network (GNN) and calculating drug-gene scores. Fourth, it performs real-world clinical validation using EHR data from over 500,000 individuals (including more than 4,000 AD cases) in the UK Biobank, evaluating associations between drug usage and AD onset while controlling for demographic and comorbidity factors. DeepDrug2 has identified several promising drug candidates. Among the top 15 candidates with sufficient medication records to support statistically powered analysis, Amlodipine (a calcium channel blocker), Indapamide (a thiazide-like diuretic), and Atorvastatin (a statin) were significantly associated with reduced AD risk (p < 0.05). These findings highlight the role of germline mutations in guiding AD drug repurposing and emphasize the value of integrating real-world clinical data into AI-driven drug discovery. To further validate these candidates, future work will involve experimental studies using mouse and zebrafish models of AD. DeepDrug2 offers a promising strategy to support future clinical studies and expand therapeutic options for AD. Future work will evolve DeepDrug2 into a more powerful, versatile, and precise tool for AI-driven drug repurposing in neurodegenerative diseases by deeply integrating advanced LLM capabilities, prioritizing critical disease mechanisms, including tau pathology, and holistically incorporating multi-modal data sources.

2
Discovering Genetic Signatures Associated with Alzheimer's Disease in Tiled Whole Genome Sequence Data: Results from the Artificial Intelligence for Alzheimer's Disease (AI4AD) Consortium

Zaranek, S. W.; Zaranek, A. W.; Amstutz, P.; Bao, J.; Chen, J.; Clegg, T.; Craft, H.; Jo, T.; Lee, B.; Nho, K.; Thomopoulos, S. I.; Davatzikos, C.; Shen, L.; Huang, H.; Thompson, P. M.; Saykin, A. J.; The Alzheimer's Disease Neuroimaging Initiative as a consortium author for the AI4AD Initiative,

2024-08-03 genetic and genomic medicine 10.1101/2024.08.01.24311329 medRxiv
Top 0.1%
8.4%
Show abstract

Currently, the ability to analyze large-scale whole genome sequence (WGS) data is limited due to both the size of the data and the inability of many existing tools to scale. To address this challenge, we use data "tiling" to efficiently partition whole genome sequences into smaller segments resulting in a simple numeric matrix of small integers. This lossless representation is particularly suitable for machine learning (ML) models. As an example of the benefits of tiling, we showcase results from tiled data as part of the Artificial Intelligence for Alzheimers Disease (AI4AD) consortium. AI4AD is a coordinated initiative to develop transformative AI approaches for high throughput analysis of next generation sequencing and related imaging, AD biomarker, and cognitive data. The collective effort integrates imaging, genomic, biomarker, and cognitive data to address fundamental barriers in AD prevention and drug discovery. One of the projects initial aims is to discover new genetic signatures in WGS data that can be used to understand AD risk and progression in conjunction with imaging, biomarker and cognitive data. We tiled and analyzed 15,000+ genomes from the Alzheimers Disease Sequencing Project (ADSP) and the Alzheimers Disease Neuroimaging Initiative (ADNI). We tile 11,762 genomes, a subset of the release which does not include family-based datasets (AD Cases: 4,983, age range: 50-90 years, mean age: 73.8 years). We illustrate the use of tiled data in ML classification methods to predict phenotypes. Specifically, we identify and prioritize tile variants/genetic variants that are possible genetic signatures for AD. The model shows added predictive value from variants of genes previously found to be associated with AD risk, age of onset, neurofibrillary tangle measurements, and other AD-related traits-including the APOE variant (rs429358).

3
Moving Targets: Monitoring Target Trends in Drug Discovery by Mapping Targets, GO Terms, and Diseases

Zdrazil, B.; Richter, L.; Brown, N.; Guha, R.

2019-07-03 bioinformatics 10.1101/691550 medRxiv
Top 0.1%
6.9%
Show abstract

Drug Discovery is a lengthy and costly process and has faced a period of declining productivity within the last two decades. As a consequence, integrative data-driven approaches are nowadays on the rise in pharmaceutical research, making use of an inter-connected (network) view on diseases. In addition, evidence-based decisions are alleviated by studying the time evolution of innovation trends in drug discovery.\n\nIn this paper a new approach leveraging data mining and data integration for inspecting target innovation trends protein family-wise is presented. The study highlights protein families which are receiving emerging interest in the drug discovery community (mainly kinases and G protein coupled receptors) and those with areas of interest in target space that have just emerged in the scientific literature (mainly kinases and transporters) highlighting novel opportunities for drug intervention.\n\nIn order to delineate the evolution of target-driven research interest from a biological perspective, trends in biological process annotations from Gene Ontology (GO) and disease annotations from DisGeNet for major target families are captured. The analysis reveals an increasing interest in targets related to immune system processes, and a recurrent trend for targets involved in circulatory system processes. At the level of disease annotations, targets associated to e.g., cancer-related pathologies as well as to intellectual disability and schizophrenia are increasingly investigated nowadays.\n\nCan this knowledge be used to study the \"movement of targets\" in a network view and unravel new links between diseases and biological processes? We tackled this question by creating dynamic network representations considering data from different time periods. The dynamic network for immune system process-associated targets suggest that e.g. breast cancer as well as schizophrenia are linked to the same targets (cannabinoid receptor CB2 and VEGFR2) thus suggesting similar treatment options which could be confirmed by literature search. The methodology has the potential to identify other drug repurposing candidates and enables researchers to capture trends in research attention in target space at an early stage.\n\nThe KNIME workflows and R scripts used in this study are publicly available from https://github.com/BZdrazil/Moving_Targets.\n\nAuthor summaryIn this study we have investigated target innovation in drug discovery over a period of 22 years (1995-2016) by extracting time trends of research interest (as published in the scientific literature and stored in the ChEMBL database) in certain protein classes inspecting different measures (numbers of pharmacological measurements, targets, papers, and drugs). Focusing on the most relevant protein classes in drug discovery (G protein-coupled receptors, kinases, ion channels, nuclear receptors, proteases, and transporters), we further linked single targets to Gene Ontology (GO) biological process annotations and inspected steep increasing or decreasing trends of GO annotations within target families over time. We also tracked trends in disease annotations from DisGeNET by filtering out diseases linked to targets with emerging trends in research interest. Finally, targets, GO terms, and diseases are interconnected in network representations and shifts in research foci are investigated over time. This new methodology which utilizes data mapping and data analysis can be used to explore trends in research attention target family-wise, to uncover previously unknown links between diseases and biological processes and to identify potential candidates for drug repurposing.

4
Cell Painting-based bioactivity prediction boosts high-throughput screening hit-rates and compound diversity

Fredin Haslum, J.; Lardeau, C.-H.; Karlsson, J.; Turkki, R.; Leuchowius, K.-J.; Smith, K.; Mullers, E.

2023-04-05 bioinformatics 10.1101/2023.04.03.535328 medRxiv
Top 0.1%
6.4%
Show abstract

Efficiently identifying bioactive compounds towards a target of interest remains a time- and resource-intensive task in early drug discovery. The ability to accurately predict bioactivity using morphological profiles has the potential to rationalize the process, enabling smaller screens of focused compound sets. Towards this goal, we explored the application of deep learning with Cell Painting, a high-content image-based assay, for compound bioactivity prediction in early drug screening. Combining Cell Painting data and unrefined single-concentration activity readouts from high-throughput screening (HTS) assays, we investigated to what degree morphological profiles could predict compound activity across a set of 140 unique assays. We evaluated the performance of our models across different target classes, assay technologies, and disease areas. The predictive performance of the models was high, with a tendency for better predictions on cell-based assays and kinase targets. The average ROC-AUC was 0.744 with 62% of assays reaching [&ge;]0.7, 30% reaching [&ge;]0.8 and 7% reaching [&ge;]0.9 average ROC-AUC, outperforming commonly used structure-based predictions in terms of predictive performance and compound structure diversity. In many cases, bioactivity prediction from Cell Painting data could be matched using brightfield images rather than multichannel fluorescence images. Experimental validation of our predictions in follow-up assays confirmed enrichment of active compounds. Our results suggest that models trained on Cell Painting data can predict compound activity in a range of high-throughput screening assays robustly, even with relatively noisy HTS assay data. With our approach, enriched screening sets with higher hit rates and higher hit diversity can be selected, which could reduce the size of HTS campaigns and enable primary screening with more complex assays.

5
Network Based Identification of Holistic Drug Target for Parkinson Disease and Deep Learning assisted Drug Repurposing.

Raza, A.; Muddassar, M.

2022-11-20 bioinformatics 10.1101/2022.11.18.515243 medRxiv
Top 0.1%
6.3%
Show abstract

Parkinson is a neurodegenerative disorder of the nervous system involved with disrupting the motor activity of the body. The current pathogenesis of the disorder is incomplete resulting in widespread use of exogenous medical treatments targeting the dopamine quantity, posing a major challenge in appropriate drug development. The plethora of high throughput techniques in the last decade has yielded a vast amount of Omics dataset with an opportunity of providing a holistic overview of the disease workings and dynamics. We integrated the Parkinson disease Omics datasets using network-based integration strategies to build Parkinson disease network. The most impactful and resilient node of the network was selected as a drug target. Deep learning based virtual screening estimator was built from physicochemical properties of different compounds having variable affinity to target binding. Virtual screening of FDA approved drugs repurposed 19 drugs with 25% of them falling under insomnia treatment; the most prevalent sleep disorder in Parkinson patients. Source Code of the project is available at https://github.com/aysanraza/pd_repurposing_protocol

6
Adera: A drug repurposing workflow for neuro-immunological investigations using neural networks.

Lazarczyk, M.; Mickael, M. E.; Sacharczuk, M.

2022-07-14 bioinformatics 10.1101/2022.07.14.500072 medRxiv
Top 0.1%
6.3%
Show abstract

Drug repurposing in the context of neuro-immunological (NI) investigations is still in its primary stages. Drug repurposing is an important method that bypasses lengthy drug discovery procedures and rather focuses on discovering new usage for known medications. Neuro-immunological diseases such as Alzheimers, Parkinson, multiple sclerosis and depression include various pathologies that resulted from the interaction between the central nervous system and the immune system. However, repurposing of medications is hindered by the vast amount of information that needs mining. To challenge the need for repurposing known medications for neuro-immunological diseases, we built a deep neural network named Adera to perform drug repurposing. The model uses two deep learning networks. The first network is an encoder and its main task is to embed text into matrices. The second network we explored the usage of two different loss function, binary cross entropy and means square error (MSE). Furthermore, we investigated the effect of ten different network architecture with each loss function. Our results show that for the binary cross entropy loss function, the best architecture consists of a two layers of convolution neural network and it achieves a loss of less than 0.001. In the case of MSE loss function a shallow network using aRelu activation achieved an accuracy of over 98 % and loss of 0.001. Additionally, Adera was able to predict various drug repurposing targets in agreement with DRUG Repurposing Hub. These results establish the ability of Adera to repurpose with high accuracy drug candidates that can shorten the development of the drug cycle. The software could be downloaded from https://github.com/michel-phylo/ADERA1.

7
INSIGHT: In Silico Drug Screening Platform using Interpretable Deep Learning Network

Shi, X.; Ramathal, C.; Dezso, Z.

2025-04-24 bioinformatics 10.1101/2025.04.21.649855 medRxiv
Top 0.1%
6.3%
Show abstract

The large-scale multiplexed drug screening platforms like PRISM and GDSC facilitate the screening of drug treatments over 1,000 cancer cell lines. The cancer cell lines are well characterized by multiomics screening in CCLE and DepMap, enabling the application of AI and machine learning techniques to study the association between drug sensitivity and the underlying molecular profiles. The large scale and variety of data modalities enabled us to build an interpretable deep learning framework, INSIGHT, integrating the multiomics data and the drugs molecular structure to predict drug response. We trained our model on the PRISM screen for single treatments and on the DrugComb screen database for combination treatments. Our method enables the in-silico extension of current screens by predicting drug response in cancer cell lines not included in the screen, as well as the drug response to novel single agent or combination therapy by leveraging the drugs molecular structure. Furthermore, the deep learning framework was built to enable biological interpretation. The connections between the hidden layers of the neural network incorporate prior biological knowledge such as signaling pathways. This enables the model, in addition to predicting the drug sensitivity profiles, to prioritize the pathways predictive of drug response and identify pathways related to the mechanism of action (MOA) and potential off target effects of novel drugs. The evaluation of our model using cross-validation on the PRISM and DrugComb dataset showed an improved performance compared to previously developed biologically informed deep learning methods and traditional state of the art machine learning methods like elastic-net and XGBoost. We illustrated with examples the value of incorporating biological knowledge into INSIGHT by relating the pathway activity of the predictive models to the MOA.

8
Understanding the Sources of Performance in Deep Learning Drug Response Prediction Models

Branson, N.; Cutillas, P. R.; Bessant, C.

2024-06-06 bioinformatics 10.1101/2024.06.05.597337 medRxiv
Top 0.1%
5.0%
Show abstract

Anti-cancer drug response prediction (DRP) using cancer cell lines plays a vital role in stratified medicine and drug discovery. Recently there has been a surge of new deep learning (DL) models for DRP that improve on the performance of their predecessors. However, different models use different input data types and neural network architectures making it hard to find the source of these improvements. Here we consider multiple published DRP models that report state-of-the-art performance in predicting continuous drug response values. These models take the chemical structures of drugs and omics profiles of cell lines as input. By experimenting with these models and comparing with our own simple benchmarks we show that no performance comes from drug features, instead, performance is due to the transcriptomics cell line profiles. Furthermore, we show that, depending on the testing type, much of the current reported performance is a property of the training target values. To address these limitations we create novel models (BinaryET and BinaryCB) that predict binary drug response values, guided by the hypothesis that this reduces the noise in the drug efficacy data. Thus, better aligning them with biochemistry that can be learnt from the input data. BinaryCB leverages a chemical foundation model, while BinaryET is trained from scratch using a transformer-type model. We show that these models learn useful chemical drug features, which is the first time this has been demonstrated for multiple DRP testing types to our knowledge. We further show binarising the drug response values is what causes the models to learn useful chemical drug features. We also show that BinaryET improves performance over BinaryCB, and over the published models that report state-of-the-art performance.

9
Long-term innovative potential of genetic research and its suppression

Chae, J.; Kim, W.; Jung, W.; Jeong, D.; Lim, R.; Chamlagain, M.; Jung, G.; Jang, J.; Lee, J. W.; Kang, N. K.; Baek, K.; Shin, J.; Lee, Y.-G.; Koh, H. G.; Kim, C.; Yook, S.; Cheung, A. K. L.; Jin, Y.-S.; Youn, H.; Kim, P.-J.; Ghim, C.-M.

2025-02-18 scientific communication and education 10.1101/2025.02.17.638429 medRxiv
Top 0.1%
5.0%
Show abstract

Current technological revolutions, involving artificial intelligence, mRNA vaccines, and quantum computing, are largely driven by industry. Despite the existing perception that commercial motives promote cutting-edge innovation, concerns may arise about their risk of limiting scientific exploration from diverse perspectives, which nurtures long-term innovative potential. Here, we investigate the interplay between scientific exploration and industrial influence by analyzing about 20 million papers and US, Chinese, and European patents in genetic research, a domain of far-reaching societal importance. We observe that research on new genes has declined since the early 2000s, but the exploration of novel gene combinations still underpins biotechnology innovation. Fields of highly practical or commercial focus are less likely to adopt the innovative approaches, exhibiting lower research vitality. Additionally, continuous scientific research creates exploratory opportunities for innovation, while industrys R&D efforts are typically short-lived. Alarmingly, up to 42.2-74.4% of these exploratory opportunities could be lost if scientific research is restrained by industry interests, highlighting the cost of over-reliance on commercially-driven research. Given the industrys dominance in recent technologies, our work calls for a balanced approach with long-term scientific exploration to preserve innovation vitality, unlock the full potential of genetic research and biotechnology, and address complex global challenges.

10
A machine-learning evaluation of biomarkers designed for the future of precision medicine

Climer, S.

2023-07-12 health informatics 10.1101/2023.07.09.23292430 medRxiv
Top 0.1%
4.9%
Show abstract

Precision medicine is cognizant of the impact of genetics and environments on subtypes of heterogeneous diseases and aims to identify, diagnose, and treat each subtype appropriately. Real-valued biomarkers, such as protein levels in plasma, are key for practical subtype diagnoses and hold potential to elucidate subtypes and illuminate promising drug targets. Biomarkers that are common across all subtypes have been discovered using fold change (FC) and the area under the receiver operating characteristic curve (AUC). However, FC and AUC fail to identify biomarkers for subtypes when they comprise less than half of the disease group. We present here a machine-learning biomarker evaluation method based on clustering of the data points, referred to as Difference in Bicluster Distances (DBD). We contribute efficient, yet optimal, software coupled with rigorous validation techniques, and demonstrate our approach on a late-onset Alzheimer disease (AD) gene expression dataset. Our trials produced four significant genes and appropriate thresholds for biomarker diagnostics. While none of these genes were identified as significant by either FC or AUC for the given dataset, the genes have been independently associated with AD or neurological disorders by other groups using completely independent means. In summary, DBD provides a unique and effective method for screening real-valued data to identify biomarkers associated with subtypes of heterogeneous diseases.

11
ImmunAL: a frame to identify the immunological markers for Mild Moderated Alzheimer's Disease applying Multiplex Network Model

Sen, S.; Chatterje, A.; Maulik, U.

2021-10-19 bioinformatics 10.1101/2021.10.18.464796 medRxiv
Top 0.1%
4.9%
Show abstract

Identification of immunological markers for neurodegenerative diseases resolve issues related to diagnostic and therapeutic. Neuro-specific cells experience disruptive mechanisms in the early stages of disease progression. The autophagy mechanism, guided by the autoantibodies, is one of the prime indicators of neurodegenerative diseases. Identifying autoantibodies can show a new direction. Detecting influential autoantibodies from relational networks viz., co-expression, co-methylation, etc. is a well-studied area. However, none of the studies have considered the functional affinity among the autoantibodies while selecting them from a relational network. In this regard, a twolayered multiplex network based framework has been proposed, whereby the layers consist co-expression and co-semantic scores. The networks have been formed using three distinct cases viz., diseased, controlled, and a combination of both. Subsequently, a random walk with restart mechanism has been applied to identify the influential autoantibodies, where layer switching probability and restart probability are 0.5 and 0.4 respectively. Next, pathway semantic network has been formed considering the autoantibody associated pathways. EPO and IL1RN, associated with a maximum number of pathways, are identified as the two most influential autoantibodies. The network also provides insights into possible molecular mechanisms during the pathogenic progression. Finally, MDPI and CNN3 are also identified as important biomarkers. AvailabilityThe code is available at https://github.com/agneet42/ImmunAL

12
Insights into Drug Cardiotoxicity from Biological and Chemical Data: The First Public Classifiers for FDA DICTrank

Seal, S.; Spjuth, O.; Hosseini-Gerami, L.; Garcia-Ortegon, M.; Singh, S.; Bender, A.; Carpenter, A. E.

2023-10-18 bioinformatics 10.1101/2023.10.15.562398 medRxiv
Top 0.1%
4.9%
Show abstract

Drug-induced cardiotoxicity (DICT) is a major concern in drug development, accounting for 10-14% of postmarket withdrawals. In this study, we explored the capabilities of various chemical and biological data to predict cardiotoxicity, using the recently released Drug-Induced Cardiotoxicity Rank (DICTrank) dataset from the United States FDA. We analyzed a diverse set of data sources, including physicochemical properties, annotated mechanisms of action (MOA), Cell Painting, Gene Expression, and more, to identify indications of cardiotoxicity. We found that such data, including protein targets, especially those related to ion channels (such as hERG), physicochemical properties (such as electrotopological state) as well as peak concentration in plasma offer strong predictive ability as well as valuable insights into DICT. We also found compounds annotated with particular mechanisms of action, such as cyclooxygenase inhibition, could distinguish between most-concern and no-concern DICT compounds. Cell Painting features related to ER stress discern the most-concern cardiotoxic compounds from non-toxic compounds. While models based on physicochemical properties currently provide substantial predictive accuracy (AUCPR = 0.93), this study also underscores the potential benefits of incorporating more comprehensive biological data in future DICT predictive models. With the availability of - omics data in the future, using biological data promises enhanced predictability and delivers deeper mechanistic insights, paving the way for safer therapeutic drug development. All models and data used in this study are publicly released at https://broad.io/DICTrank_Predictor O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=136 SRC="FIGDIR/small/562398v1_ufig1.gif" ALT="Figure 1"> View larger version (32K): org.highwire.dtl.DTLVardef@755906org.highwire.dtl.DTLVardef@27c43borg.highwire.dtl.DTLVardef@c430dborg.highwire.dtl.DTLVardef@63b2ad_HPS_FORMAT_FIGEXP M_FIG C_FIG

13
Large Language Model-Driven Prioritization of Alzheimer's Disease Drug Targets Across Multidimensional Criteria

Adaszewski, S.; Schindler, T.

2025-12-29 health informatics 10.64898/2025.12.28.25343106 medRxiv
Top 0.1%
4.9%
Show abstract

Large language models (LLMs) offer new opportunities to synthesize the vast and heterogeneous biomedical literature, yet their potential to support drug target prioritization in complex diseases such as Alzheimers disease (AD) remains largely unexplored. Here, we introduce an LLM-driven framework that evaluates and ranks AD therapeutic targets across six criteria central to pharmaceutical decision-making: biological confidence, technical feasibility, clinical developability, patient impact, competitive landscape, and safety assessment. Using Gemini 2.5 Pro augmented with real-time web search, we performed large-scale pairwise comparative evaluations and pointwise scoring across a focused set of 522 AD-associated targets with high-quality chemical probes--a tractable subset enriched for clinically advanced targets. We implemented a novel pairwise QuickSort-based ranking procedure that leverages the LLM as a comparative oracle, and benchmarked its performance against pointwise scoring across 16 replicate runs per criterion. Retrieval-augmented LLM reasoning substantially improved early enrichment of clinically validated AD targets, outperforming LLM-only prompting and approaching the performance of the OpenTargets association benchmark. Pairwise comparative reasoning consistently exceeded pointwise scoring across five of six criteria, yielding higher stability, stronger inter-criterion structure, and markedly improved normalized gain metrics. Multi-objective integration using Pareto fronts and utopia-point scoring further enhanced consensus and robustness, producing holistic rankings that nearly matched the strongest individual criteria while exhibiting superior cross-category coherence. Challenges remained in assessing competitiveness and safety--domains with sparse or inconsistent literature representation--highlighting areas where hybrid models integrating structured datasets may be required. Together, these results demonstrate that retrieval-augmented LLMs, when combined with structured comparative prompting and multi-criteria integration, can approximate expert-level reasoning and meaningfully enrich target prioritization pipelines for AD. This framework provides a scalable, interpretable, and biologically grounded approach for early-stage drug discovery, with broad applicability to other complex diseases. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=183 HEIGHT=200 SRC="FIGDIR/small/25343106v1_ufig1.gif" ALT="Figure 1"> View larger version (25K): org.highwire.dtl.DTLVardef@1cdfc2org.highwire.dtl.DTLVardef@1a666bforg.highwire.dtl.DTLVardef@1a3ae73org.highwire.dtl.DTLVardef@1121b88_HPS_FORMAT_FIGEXP M_FIG C_FIG

14
The Gene Expression Landscape of Disease Genes

Garcia-Gonzalez, J.; Garcia-Gonzalez, S.; Liou, L.; O'Reilly, P. F.

2024-06-21 genetic and genomic medicine 10.1101/2024.06.20.24309121 medRxiv
Top 0.1%
4.8%
Show abstract

Fine-mapping and gene-prioritisation techniques applied to the latest Genome-Wide Association Study (GWAS) results have prioritised hundreds of genes as causally associated with disease. Here we leverage these recently compiled lists of high-confidence causal genes to interrogate where in the body disease genes operate, which, in previous studies, has mostly been investigated by testing for enrichment of GWAS signal among genes with cell/tissue specific expression. By integrating GWAS summary statistics, gene prioritisation results, and RNA-seq data from 46 tissues and 204 cell types, we directly analyse the gene expression of putative disease genes across the body in relation to 11 major diseases and cancers. In tissues and cell types with established disease relevance, disease genes show higher and more specific gene expression compared to control genes. However, we also detect elevated expression in tissues and cell types without previous links to the corresponding disease. While some of these results may be explained by cell types that span multiple tissues, such as macrophages in brain, blood, lung and spleen in relation to Alzheimers disease (P-values < 10-3), the cause for others is unclear and warrants further investigation. To support functional follow-up studies of disease genes, we identify technical and biological factors influencing their expression, and highlight tissues in which higher expression is associated with increased odds of inclusion in drug development programs. We provide our systematic testing framework as an open-source, publicly available tool that can be utilised to offer novel insights into the genes, tissues and cell types involved in any disease, with the potential for informing drug development and delivery strategies.

15
Benchmarking ensemble docking methods as a scientific outreach project

Gan, J. L.; Kumar, D.; Chen, C.; Taylor, B. C.; Jagger, B. R.; Amaro, R. E.; Lee, C. T.

2020-10-04 scientific communication and education 10.1101/2020.10.02.324343 medRxiv
Top 0.1%
4.3%
Show abstract

The discovery of new drugs is a time consuming and expensive process. Methods such as virtual screening, which can filter out ineffective compounds from drug libraries prior to expensive experimental study, have become popular research topics. As the computational drug discovery community has grown, in order to benchmark the various advances in methodology, organizations such as the Drug Design Data Resource have begun hosting blinded grand challenges seeking to identify the best methods for ligand pose-prediction, ligand affinity ranking, and free energy calculations. Such open challenges offer a unique opportunity for researchers to partner with junior students (e.g., high school and undergraduate) to validate basic yet fundamental hypotheses considered to be uninteresting to domain experts. Here, we, a group of high school-aged students and their mentors, present the results of our participation in Grand Challenge 4 where we predicted ligand affinity rankings for the Cathepsin S protease, an important protein target for autoimmune diseases. To investigate the effect of incorporating receptor dynamics on ligand affinity rankings, we employed the Relaxed Complex Scheme, a molecular docking method paired with molecular dynamics-generated receptor conformations. We found that CatS is a difficult target for molecular docking and we explore some advanced methods such as distance-restrained docking to try to improve the correlation with experiments. This project has exemplified the capabilities of high school students when supported with a rigorous curriculum, and demonstrates the value of community-driven competitions for beginners in computational drug discovery.

16
Anticancer Target Combinations: Network-Informed Signaling-Based Approach to Discovery

Yavuz, B. R.; Jang, H.; Nussinov, R.

2024-10-15 bioinformatics 10.1101/2024.10.11.617918 medRxiv
Top 0.1%
4.3%
Show abstract

While anticancer drug discovery has seen dramatic innovations and successes, sequential single therapies are time-limited by resistance, and combinatorial strategies have been lagging. The number of possible drug combinations is vast. To select drug combinations the oncologist requires knowledge of the optimal combination of proteins to co-target. Currently, combinations that the oncologist considers are primarily from empirical observations and clinical praxis. Our aim is to develop a signaling-based method to discover optimal proteins for the oncologist to co-target with drug combinations, and test it on available, patient-derived data. To temper the expected resistance to single drug regimen, we offer a concept-based stratified pipeline aimed at selecting co-targets for drug combinations. Our strategy is unique in its co-target selection being based on signaling pathways. This is significant since in cancer, drug resistance commonly bypasses blocked proteins by wielding alternative, or complementary, routes to execute cell proliferation. Our network-informed signaling-based approach harnesses advanced network concepts and metrics, and our compiled, tissue-specific co-existing mutations. Co-existing driver mutations are common in resistance. Thus, to mimic cancer and counter drug resistance scenarios, our pipeline seeks co-targets that when targeted by drug combinations, can shut off cancers modus operandi. That is, its parallel or complementary signaling pathways would be blocked. Rotating through combinations could further lessen emerging resistance. We applied it to patient-derived breast and colorectal ESR1|PIK3CA and BRAF|PIK3CA subnetworks. Consistently, in breast cancer, our results suggest co-targeting proteins from the ESR1|PIK3CA subnetwork with an alpelisib-LJM716 combination. In colorectal cancer, they co-target BRAF|PIK3CA with alpelisib, cetuximab, and encorafenib combination. Collectively, our pipelines results are promising, and validated by patient-based xenografts. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=200 SRC="FIGDIR/small/617918v1_ufig1.gif" ALT="Figure 1"> View larger version (65K): org.highwire.dtl.DTLVardef@81dc11org.highwire.dtl.DTLVardef@197012dorg.highwire.dtl.DTLVardef@ce3853org.highwire.dtl.DTLVardef@d3ec3a_HPS_FORMAT_FIGEXP M_FIG C_FIG

17
Accurate PROTAC targeted degradation prediction with DegradeMaster

Liu, J.; Roy, M.; Isbel, L.; Li, F.

2025-02-07 bioinformatics 10.1101/2025.02.03.636343 medRxiv
Top 0.1%
4.3%
Show abstract

MotivationProteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules that can degrade undruggable protein of interest (POI) by recruiting E3 ligases and hijacking the ubiquitin-proteasome system. Some efforts have been made to develop deep learning-based approaches to predict the degradation ability of a given PROTAC. However, existing deep learning methods either simplify proteins and PROTACs as 2D graphs by disregarding crucial 3D spatial information or exclusively rely on limited labels for supervised learning without considering the abundant information from unlabeled data. Nevertheless, considering the potential to accelerate drug discovery, it is critical to develop more accurate computational methods for PROTAC-targeted protein degradation prediction. ResultsThis study proposes DegradeMaster, a semi-supervised E(3)-equivariant graph neural network-based predictor for targeted degradation prediction of PROTACs. DegradeMaster leverages an E(3)-equivariant graph encoder to incorporate 3D geometric constraints into the molecular representations and utilises a memory-based pseudo-labeling strategy to enrich annotated data during training. A mutual attention pooling module is also designed for interpretable graph representation. Experiments on both supervised and semi-supervised PROTAC datasets demonstrate that DegradeMaster outperforms state-of-the-art baselines, with substantial improvement of AUROC by 10.5%. Case studies show DegradeMaster achieves 88.33% and 77.78% accuracy in predicting the degradability of VZ185 candidates on BRD9 and ACBI3 on KRAS mutants. Visualization of attention weights on 3D molecule graph demonstrates that DegradeMaster recognises linking and binding regions of warhead and E3 ligands and emphasizes the importance of structural information in these areas for degradation prediction. Together, this shows the potential for cutting-edge tools to highlight functional PROTAC components, thereby accelerating novel compound generation. AvailabilityThe source code and datasets are available at https://github.com/Jackson117/DegradeMaster and https://zenodo.org/records/14715718.

18
Clinical Advancement Forecasting

Czech, E. A.; Wojdyla, R. S.; Himmelstein, D. S.; Frank, D. H.; Miller, N. A.; Milwid, J. M.; Kolom, A.; Hammerbacher, J.

2024-08-03 genetic and genomic medicine 10.1101/2024.08.02.24311422 medRxiv
Top 0.1%
4.2%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWChoosing which drug targets to pursue for a given disease is one of the most impactful decisions made in the global development of new medicines. This study examines the extent to which the outcomes of clinical trials can be predicted based on a small set of longitudinal (temporally labeled) evidence and properties of drug targets and diseases. We demonstrate a novel statistical learning framework for identifying the top 2% of target-disease pairs that are as much as 4-5x more likely to advance beyond phase 2 trials. This framework is 1.5-2x more effective than an Open Targets composite score based on the same set of evidence. It is also 2x more effective than a common measure for genetic support that has been observed previously, as well as in this study, to confer a 2x higher likelihood of success. Utilizing a subset of our biomedical evidence base, non-negative linear models resulting from this framework can produce simple weighting schemes across various types of human, animal, and cell model genomic, transcriptomic, proteomic, and clinical evidence to identify previously undeveloped target-disease pairs poised for clinical success. In this study we further explore: i) how longitudinal treatment of evidence relates to leakage and reverse causality in biomedical research and how temporalized evidence can mitigate common forms of potential biases and inflation ii) the relative impact of different types of features on our predictions; and iii) an analysis of the space of currently undeveloped, tractable targets predicted with these methods to have the highest likelihood of clinical success. To ease reproduction and deployment, no data is used outside of Open Targets and the described methods require no expert knowledge, and can support expansion of lines of evidence to further improve performance.

19
Small molecule bioactivity benchmarks are often well-predicted by counting cells

Seal, S.; Dee, W.; Shah, A.; Zhang, A.; Titterton, K.; Cabrera, A. A.; Boiko, D.; Beatson, A.; Puigvert, J. C.; Singh, S.; Spjuth, O.; Bender, A.; Carpenter, A. E.

2025-04-30 bioinformatics 10.1101/2025.04.27.650853 medRxiv
Top 0.1%
4.1%
Show abstract

Phenotypic profiling methods, such as Cell Painting and gene expression, have been widely used to predict compound bioactivity, often showing improvement over predictive models based on chemical structures alone. We discovered that a large subset of assays in widely-used benchmark datasets either directly relate to cell health and cytotoxicity or are assays intending to capture a more specific phenotype but whose active compounds impact cell count, while inactives do not. As a result, counting cells can achieve similar predictive performance as Cell Painting or gene expression data. Filtering benchmarks to include only assays relating to protein targets reveals that Cell Painting can capture information that cannot be predicted by mere cell counting. We re-evaluated three benchmark datasets used with Cell Painting data and observed that, in many cases, cell count models produced an AUC comparable to models using the full Cell Painting profiles. However, in protein-target-specific benchmarks across 17 distinct protein targets, Cell Painting features demonstrated unique predictive power, outperforming mean balanced accuracy from cell count models with a relative improvement of 19.6%. We propose five practical recommendations for benchmarking machine learning models for predicting bioactivity, including using cell count as a baseline feature. Although multi-class classification applications (such as matching samples based on their morphological profile) are less likely to be predictable by cell count than bioactivity benchmarks, these recommendations are broadly applicable to machine learning for drug discovery.

20
Unsupervised co-optimization of a graph neural network and a knowledge graph embedding model to prioritize causal genes for Alzheimers Disease

Liu, K.; Prabhakar, V.

2022-10-06 health informatics 10.1101/2022.10.03.22280657 medRxiv
Top 0.1%
4.0%
Show abstract

1.Data obtained from clinical trials for a given disease often capture reliable empirical features of the highest quality which are limited to few studies/experiments. In contrast, knowledge data extracted from biomedical literature captures a wide range of clinical information relevant to a given disease that may not be as reliable as the experimental data. Therefore, we propose a novel method of training that co-optimizes two AI algorithms on experimental data and knowledge-based information from literature respectively to supplement the learning of one algorithm with that of the other and apply this method to prioritize/rank causal genes for Alzheimers Disease (AD). One algorithm generates unsupervised embeddings for gene nodes in a protein-protein interaction network associated with experimental data. The other algorithm generates embeddings for the nodes/entities in a knowledge graph constructed from biomedical literature. Both these algorithms are co-optimized to leverage information from each others domain. Therefore; a downstream inferencing task to rank causal genes for AD ensures the consideration of experimental and literature data available to implicate any given gene in the geneset. Rank-based evaluation metrics computed to validate the gene rankings prioritized by our algorithm showed that the top ranked positions were highly enriched with genes from a ground truth set that were experimentally verified to be causal for the progression of AD.